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Abstract 

Community detection is a fundamental statistical problem in network data analysis. Many 
algorithms have been proposed to tackle this problem. Most of these algorithms are not guar¬ 
anteed to achieve the statistical optimality of the problem, while procedures that achieve infor¬ 
mation theoretic limits for general parameter spaces are not computationally tractable. In this 
paper, we present a computationally feasible two-stage method that achieves optimal statistical 
performance in misclassification proportion for stochastic block model under weak regularity 
conditions. Our two-stage procedure consists of a refinement stage motivated by penalized lo¬ 
cal maximum likelihood estimation. This stage can take a wide range of weakly consistent 
community detection procedures as initializer, to which it applies and outputs a community as¬ 
signment that achieves optimal misclassification proportion with high probability. The practical 
effectiveness of the new algorithm is demonstrated by competitive numerical results. 

Keywords. Clustering, Community detection, Minimax rates, Network analysis, Spectral 
clustering. 


1 Introduction 

Network data analysis [71, 29] has become one of the leading topics in statistics. In fields such 
as physics, computer science, social science and biology, one observes a network among a large 
number of subjects of interest such as particles, computers, people, etc. The observed network can 
be modeled as an instance of a random graph and the goal is to infer structures of the underlying 
generating process. A structure of particular interest is community: there is a partition of the 
graph nodes in some suitable sense so that each node belongs to a community. Starting with 
the proposal of a series of methodologies [28, 55, 33, 40], we have seen a tremendous literature 
devoted to algorithmic solutions to uncovering community structure and great advances have also 
been made in recent years on the theoretical understanding of the problem in terms of statistical 
consistency and thresholds for detection and exact recoveries. See, for instance, [11, 23, 77, 51, 53, 
49, 2, 54, 31], among others. In spite of the great efforts exerted on this “community detection” 
problem, its state-of-the-art solution has not yet reached the comparable level of maturity as what 
statisticians have achieved in other high dimensional problems such as nonparametric estimation 
[63, 38], high dimensional regression [12] and covariance matrix estimation [14], etc. In these more 
well-established problems, not only do we know the fundamental statistical limits, we also have 
computationally feasible algorithms to achieve them. The major goal of the present paper is to serve 
as a step towards such maturity in network data analysis by proposing a computationally feasible 
algorithm for community detection in stochastic block model with provable statistical optimality. 
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To describe network data with community structure, we focus on the stochastic block model 
(SBM) proposed by [34]. Let A € {0, l} nXT1 be the symmetric adjacency matrix of an undirected 
random graph generated according to an SBM with k communities. Then the diagonal entries of A 
are all zeros and each A uv = A vu for u > v is an independent Bernoulli random variable with mean 
P uv = f° r some symmetric connectivity matrix B E [0, l] fcxfc and some label function 

a : [ra] —y [k], where for any positive integer m, [m] = {1,... ,m}. In other word, if the u th node 
and the node belong to the z th and the j th community respectively, then o{u) = i, cr(v) = j 
and there is an edge connecting u and v with probability Bij. Community detection then refers to 
the problem of estimating the label function er subject to a permutation of the community labels 
{1,..., k}. A natural loss function for such an estimation problem is the proportion of wrong labels 
(subject to a permutation of the label set [k]), which we shall refer to as misclassification proportion 
from here on. 

In ground breaking works by Mossel et al. [51, 53] and Massoulie [49], the authors established 
sharp threshold for the regimes in which it is possible and impossible to achieve a misclassification 
proportion strictly less than ^ when k = 2 and both communities are of the same size (so that 
it is better than random guess), which solved the conjecture in [23] that was only justified in 
physics rigor. For some recent progress on the general case of fixed k and possibly unequal sized 
communities, see [1]. On the other hand, Abbe et al. [2], Mossel et al. [54] and Hajek et al. [31] 
established the necessary and sufficient condition for ensuring zero misclassification proportion 
(usually referred to as “strong consistency” in the literature) with high probability when k = 2 and 
community sizes are equal, and was later generalized to a larger set of fixed k by [32]. Arguably, 
what is of more interest to statisticians is the intermediate regime between the above two cases, 
namely when the misclassification proportion is vanishing as the number of nodes grows but not 
exactly zero. This is usually called the regime of “weak consistency” in the network literature. 

To achieve weak (and strong) consistency, statisticians have proposed various methods. One 
popular approach is spectral clustering [65] which is motivated by the observation that the rank 
of the n x n matrix P = (P uv ) = {B a ^ u w„)) is at most k and its leading eigenvectors contain 
information of the community structure. The application of spectral clustering on network data 
goes back to [30, 50], and its performance under the stochastic block model has been investigated by 
[21, 59, 62, 25, 57, 39, 45, 67, 19, 37, 42], among others. To further improve the performance, various 
ways for refining spectral clustering have been proposed, such as those in [7, 54, 46, 72, 19] which 
lead to strong consistency or convergence rates that are exponential in signal-to-noise ratio, while 
[52] studied the problem of minimizing a non-vanishing misclassification proportion. However, in 
the regime of weak consistency, these refinement methods are not guaranteed to attain the optimal 
misclassification proportion to be introduced below. Another important line of research is devoted 
to the investigation of likelihood-based methods, which was initiated by [11] and later extended 
to more general settings by [77, 20]. To tackle the intractability of optimizing the likelihood 
function, an EM algorithm using pseudo-likelihood was proposed by [7]. Another way to overcome 
the intractability of the maximum likelihood estimator (MLE) is by convex relaxation. Various 
semi-definite relaxations were studied by [13, 18, 6], and the aforementioned sharp threshold for 
strong consistency in [31, 32] was indeed achieved by semi-definite programming. Recently, Zhang 
and Zhou [74] established the minimax risk for misclassification proportion in SBM under weak 
conditions, which is of the form 

ex P ( - (1 + °(1))^-) ( X ) 

if all k communities are of equal sizes, where I* is the minimum Renyi divergence of order \ [58] 
of the within and the between community edge distributions. See Theorem 1 below for a more 
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general and precise statement of the minimax risk. Unfortunately, Zhang and Zhou [74] used MLE 
for achieving the risk in (1) which was hence computationally intractable. Moreover, none of the 
spectral clustering based method or tractable variants of the likelihood based method has a known 
error bound that matches (1) with the sharp constant 1 + o(l) on the exponent. 

The main contribution of the current paper lies in the proposal of a computationally feasible 
algorithm that provably achieves the optimal misclassification proportion established in [74] adap¬ 
tively under weak regularity conditions. It covers the cases of both finite and diverging number 
of communities and both equal and unequal community sizes and achieves both weak and strong 
consistency in the respective regimes. In addition, the algorithm is guaranteed to compute in 
polynomial time even when the number of communities diverges with the number of nodes. Since 
the error bound of the algorithm matches the optimal misclassification proportion in [74] under 
weak conditions, it achieves various existing detection boundaries in the literature. For instance, 
for any fixed number of communities, the procedure is weakly consistent under the necessary and 
sufficient condition of [51, 53], and strongly consistent under the necessary and sufficient condition 
of [2, 54, 31, 32], Moreover, it could match the optimal misclassification proportion in [74] even 
when k diverges. To the best of our limited knowledge, this is the first polynomial-time algorithm 
that achieves minimax optimal performance. In other words, the proposed procedure enjoys both 
statistical and computational efficiency. 

The core of the algorithm is a refinement scheme for community detection motivated by penal¬ 
ized maximum likelihood estimation. As long as there exists an initial estimator that satisfies a 
certain weak consistency criterion, the refinement scheme is able to obtain an improved estimator 
that achieves the optimal misclassification proportion in (1) with high probability. The key to 
achieve the bound in (1) is to optimize the local penalized likelihood function for each node sepa¬ 
rately. This local optimization step is completely data-driven and has a closed form solution, and 
hence can be computed very efficiently. The additional penalty term is indispensable as it plays a 
key rule in ensuring the optimal performance when the community sizes are unequal and when the 
within community and/or between community edge probabilities are unequal. 

To obtain a qualified initial estimator, we show that both spectral clustering and its normalized 
variant could satisfy the desired condition needed for subsequent refinement, though the refinement 
scheme works for any other method satisfying a certain weak consistency condition. Note that 
spectral clustering can be considered as a global method, and hence our two-stage algorithm runs 
in a “from global to local ” fashion. In essence, with high probability, the global stage pinpoints a 
local neighborhood in which we shall search for solution to each local penalized maximum likelihood 
problem, and the subsequent local stage finds the desired solution. From this viewpoint, one can 
also regard our approach as an “optimization after localization” procedure. Historically, this idea 
played a key role in the development of the renowned one-step efficient estimator [9, 43, 10]. It has 
also led to recent progress in non-convex optimization and localized gradient descent techniques 
for finding optimal solutions to high dimensional statistical problems. Examples include but are 
not limited to high-dimensional linear regression [76], sparse PCA [56, 47, 15, 70], sparse CCA 
[27], phase retrieval [16] and high dimensional EM algorithm [8, 69]. A closely related idea has 
also found success in the development of confidence intervals for regression coefficients in high 
dimensional linear regression. See, for instance, [75, 64, 36] and the references therein. Last but 
not least, even when viewed as a “spectral clustering plus refinement” procedure, our method 
distinguishes itself from other such methods in the literature by provably achieving the minimax 
optimal performance over a wide range of parameter configurations. 

The rest of the paper is organized as follows. Section 2 formally sets up the community detection 
problem and presents the two-stage algorithm. The theoretical guarantees for the proposed method 
are given in Section 3, followed by numerical results demonstrating its competitive performance on 
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both simulated and real datasets in Sections 4 and 5. A discussion on the results in the current 
paper and possible directions for future investigation is included in Section 6. Section 7 presents 
the proofs of main results with some technical details deferred to the appendix. 

We close this section by introducing some notation. For a matrix M = ( M%j ), we denote its 

Frobenius norm by ||M||f = yj^2ij ^ij an d its operator norm by ||M|| op = max;A/(M), where 

A i(M) is its I th singular value. We use M. j* to denote its i th row. The norm ||-|| is the usual 
Euclidean norm for vectors. For a set S, |Sj denotes its cardinality. The notation P and E are 
generic probability and expectation operators whose distribution is determined from the context. 
For two positive sequences {x n } and {y n }, x n x y n means x n /C < y n < Cx n for some constant 
C > 1 independent of n. Throughout the paper, unless otherwise noticed, we use C, c and their 
variants to denote absolute constants, whose values may change from line to line. 


2 Problem formulation and methodology 

In this section, we give a precise formulation of the community detection problem and present a 
new method for it. The method consists of two stages: initialization and refinement. We shall first 
introduce the second stage, which is the main algorithm of the paper. It clusters the network data 
by performing a node-wise penalized neighbor voting based on some initial community assignment. 
Then, we will discuss several candidates for the initialization step including a new greedy algorithm 
for clustering the leading eigenvectors of the adjacency matrix or of the graph Laplacian that is 
tailored specifically for stochastic block model. Theoretical guarantees for the algorithms introduced 
in the current section will be presented in Section 3. 


2.1 Community detection in stochastic block model 


Recall that a stochastic block model is completely characterized by a symmetric connectivity matrix 
B £ [0, l] kxk and a label vector a £ [k] n - One widely studied parameter space of SBM is 


© 0 (n, k, a, b, /3) = < (B, a) : o : [n] —> [A], | {u £ [n] : a(u) = i} | £ 


n 


/3n 


pk 1j k +1 


, Vi £ [k], 


B = ( Bij) £ [0, l] fcxfc , Bn = — for all i and Bij = — for all i ^ j 


n 


n 


(2) 


where /? > 1 is an absolute constant. This parameter space ©o (n,k,a,b, (3) contains all SBMs in 
which the within community connection probabilities are all equal to ^ and the between community 
connection probabilities are all equal to k. the special case of /3 = 1, all communities are of 
nearly equal sizes. 

Assuming equal within and equal between connection probabilities can be restrictive. Thus, we 
also introduce the following larger parameter space 


0(n, k, a, b, A, /3; a) 


(B, a) : a : [n] —> [&], | {u £ [n] : a(u) = i} \ £ 

b 1 


Pk ’k 


B = B 1 = (Bij) £ [0, l] kxk , — < 


, ,, > Ba < max B,,. = 

an ~ k(k- 1) ^ J ~ 3 


Vi £ [A:], 

b 

n 


— = min Bn < max Bn < —, 
n i i n 


Afc(.P) > A with P = (P uv ) 



(3) 
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Throughout the paper, we treat f3 > 1 and a > 1 as absolute constants, while k. a, b and A should 
be viewed as functions of the number of nodes n which can vary as n grows. Moreover, we assume 
0<n<n — 1 — e throughout the paper for some numeric constant e G (0,1). Thus, the parameter 
space 0(n, k, a, b, A, /3; a) requires that the within community connection probabilities are bounded 
from below by ^ and the connection probabilities between any two communities are bounded from 
above by h. In addition, it requires that the sizes of different communities are comparable. In order 
to guarantee that Q(n,k,a,b,X,/3',a) is a larger parameter space than ©o (n,k,a,b, /3), we always 
require A to be positive and sufficiently small such that 


©o(n, k, a, b , /3) C @(n, k, a, 6, A, /3; a). 


(4) 


According to Proposition 1 in the appendix, a sufficient condition for (4) is A < We assume 
(4) throughout the rest of the paper. 

The labels on the n nodes induce a community structure [n] = U^L,C,;, where C* = {u G [n] : a(u) = i} 
is the i th community with size n* = |C,;|. Our goal is to reconstruct this partition, or equivalently, 
to estimate the label of each node modulo any permutation of label symbols. Therefore, a natural 
error measure is the misclassification proportion defined as 


£{a,cr) 


1 

min — 

ir£S k Tl 


«6[n] 


(5) 


where stands for the symmetric group on [k] consisting of all permutations of [k]. 

2.2 Main algorithm 

We are now ready to present the main method of the paper - a refinement algorithm for community 
detection in stochastic block model motivated by penalized local maximum likelihood estimation. 

Indeed, for any SBM in the parameter space ©o(n, k, a, 6,1) with equal community size, the 
MLE for a [13, 18, 74] is 

d = argrnax ^ A uv l {a{u)= ^ v)} , (6) 

<x:[n]—>•[*;] u<v 

which is a combinatorial optimization problem and hence is computationally intractable. However, 
node-wise optimization of (6) has a simple closed form solution. Suppose the values of {a(u)}™ =2 
are known and we want to estimate <t( 1). Then, (6) reduces to 

cr(l) = argrnax E A\y (7) 

{v^l-.o(y)=i\ 

For each i G [k], the quantity i)=i} ^ exactly the number of neighbors that the first 

node has in the z th community. Therefore, the most likely label for the first node is the one it has 
the most connections with when all communities are of equal sizes. In practice, we do not know 
any label in advance. However, we may estimate the labels of all but the first node by first applying 
a community detection algorithm er° on the subnetwork excluding the first node and its associated 
edges, the adjacency matrix of which is denoted by A_i since it is the (n — 1) x (n — 1) submatrix 
of A with its first row and first column removed. Once we estimate the remaining labels, we can 
apply (7) to estimate <r(l) but with {c(u)}” =2 replaced with the estimated labels. 

For any u G [n], let A- u denote the (n — 1) x {n— 1) submatrix of A with its u th row and ?r th 
column removed. Given any community detection algorithm a 0 which is able to cluster any graph 
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Algorithm 1: A refinement scheme for community detection 

Input: Adjacency matrix A E {0, l} nxn , 
number of communities k. 
initial community detection method <7°. 

Output: Community assignment a. 


Penalized neighbor voting: 
l for u = 1 to n do 


Apply o' 0 on to obtain a^(v) for all v / u and let er°(«) = 0; 

Define Cf = {u : <7°(u) = i] for all i E [&]; let £“ be the set of edges within Cf, and £\ 
the set of edges between Cf and Cf when i ^ j; 

Define 


TDU _ 


n 


!ic“l(|c“|-i) 


TDU _ 

’ — 


If, 


\Cf\\Cf 


Vi^je[k\, 


and let 


Define a u '■ \n] 


a u = n min £>“ and b u = n max Bf. 
*e[fc] *#e[fc] lJ 

[k] by setting a u {v) = <7°(u) for all and 


d u (u) = argmax ^ A uv - p u ^ l^o („)=;} 

a^(y)=l uG[n] 


where for 


we define 


t 


U 


1 

2 


log 


q u (l - 6 u /n) 
Ml -a u /n) 


p ' = ~2t;, log 


( %- e~ tu + 1 

\ + 1 



end 

Consensus: 

6 Define ef(l) = For u = 2,... ,n, dehne 


( 8 ) 


(9) 


( 10 ) 


( 11 ) 


( 12 ) 


a(u) = argmax |{u : a\(v) = 1} PI {u : <7 n (t>) = <7 u (u)}|. 
ie[k] 


(13) 


on n — 1 nodes into k categories, we present the precise description of our refinement scheme in 
Algorithm 1. 

The algorithm works in two consecutive steps. The first step carries out the foregoing heuristics 
on a node by node basis. For each fixed node u, we first leave the node out and apply the available 
community detection algorithm er° on the remaining n — 1 nodes and the edges among them (as 
summarized in the matrix E {0, l}( n-1 ) x ( n-1 )) to obtain an initial community assignment 
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vector cjy ■ For convenience, we make a „ an n -vector by fixing &®(u) = 0, though applying cr° on 
A- u does not give any community assignment for u. We then assign the label of the u th node 
according to (10), which is essentially (7) with a replaced with < 7 ° except for the additional penalty 
term. The additional penalty term is added to ensure the optimal performance even when both 
the diagonal and the off-diagonal entries of the connectivity matrix B are allowed to take different 
values and the community sizes are not necessarily equal. To determine the penalty parameter p u 
in an adaptive way as spelled out in (11) - (12), we first estimate the connectivity matrix B based 
on A_ u in (8) - (9). After we obtain the community assignment for u, we organize the assignment 
for all n vertices into an n -vector a u . We call this step “penalized neighbor voting” since the first 
term on the RHS of (10) counts the number of neighbors of u in each (estimated) community while 
the second term is a penalty term proportional to the size of each (estimated) community. 

Once we complete the above procedure for each of the n nodes, we obtain n vectors a u E [k] n , 
u = 1 ,,n, and turn to the second step of the algorithm. The basic idea behind the second step 
is to obtain a unified community assignment by assembling {a u (u) : u G [n]} and the immediate 
hurdle is that each a u is only determined up to a permutation of the community labels. Thus, the 
second step aims to find the right permutations by (13) before we assemble the a u (u)’s. We call 
this step “consensus” since we are essentially looking for a consensus on the community labels for 
n possibly different community assignments, under the assumption that all of them are close to the 
ground truth up to some permutation. 

2.3 Initialization via spectral methods 

In this section, we present algorithms that can be used as initializers in Algorithm 1. Note that 
for any model in (3), the matrix P has rank at most k and EA U „ = P uv for all u ^ v. We 
may first reduce the dimension of the data and then apply some clustering algorithm. Such an 
approach is usually referred to as spectral clustering [65]. Technically speaking, spectral clustering 
refers to the general method of clustering eigenvectors of some data matrix. For random graphs, 
two commonly used methods are called unnormalized spectral clustering (USC) and normalized 
spectral clustering (NSC). The former refers to clustering the eigenvectors of the adjacency matrix 
A itself and the latter refers to clustering the eigenvectors of the associated graph Laplacian L(A). 
To formally define the graph Laplacian, we introduce the notation d u = X^e[n] A U v for the degree 
of the u th node. The graph Laplacian operator L : A ^ L(A) is defined by L(A) = ([L(A)] m ,) 

_1 /o _1 /o 

where [L(A)\ UV = d u d v A uv . Although there have been debates and studies on which one 
works better (see, for example, [66, 61]), for our purpose, both of them can lead to sufficiently 
decent initial estimators. 

The performances of USC and NSC depend critically on the bounds ||A — P|| op and ||L(A) — 
L(P)|| op , respectively. However, as pointed out by [19, 42], the matrices A and L(A) are not 
good estimators of P and L(P) under the operator norm when the graph is sparse in the sense 
that rnax, J ,, e j n ] P uv = o(log to/to). Thus, regularizing A and L(A) are necessary to achieve better 
performances for USC and NSC. The adjacency matrix A can be regularized by trimming those 
nodes with high degrees. Define the trimming operator T r (A) by replacing the u th row and 

the u th column of A with 0 whenever d u > r, and so T t (A) and A are of the same dimensions. It is 
argued in [19] that by removing those high-degree nodes, T r (A) has better convergence properties. 
Regularization method for graph Laplacian goes back to [7] and its theoretical properties have 
been studied by [39, 42], In particular, Amini et al. [7] proposed to use L{A T ) for NSC where 
A r = A + ^ll 7 ' and 1 = (1,1,..., 1) T £ K n . From now on, we use USC(r) and NSC(r) to denote 
unnormalized spectral clustering and normalized spectral clustering with regularization parameter 
r, respectively. Note that the unregularized USC is USC(oo) and the unregularized NSC is NSC(O). 


7 


Another important issue in spectral clustering lies in the subsequent clustering method used 
to cluster the eigenvectors. A popular choice is /c-means clustering. However, finding the global 
solution to the fc-means problem is NP-hard [4, 48]. Kumar et al. [41] proposed a polynomial time 
algorithm for achieving (1 + e) approximation to the /c-means problem for any fixed k, which was 
utilized in [45] to establish consistency for spectral clustering under stochastic block model with 
fixed number of communities. However, a closer look at the complexity bound suggests that the 
smallest possible e is proportional to k. Thus, applying the algorithm and the associated bound in 
[41] directly in our settings can lead to inferior error bounds when k —>• oo as n —> oo. To address 
this issue under stochastic block model, we propose a greedy clustering algorithm in Algorithm 2 
inspired by the fact that the clustering centers under stochastic block model are well separated from 
each other on the population level. It is straightforward to check that the complexity of Algorithm 
2 is polynomial in n. 


Algorithm 2: A greedy method for clustering 


Input: Data matrix U £ M nxfc , either the leading eigenvectors of T r (A) or that of L(A T ), 
number of communities k, 

critical radius r = with some constant fj, > 0. 

Output: Community assignment a. 

1 Set S = [n]; 

2 for i = 1 to k do 


3 

4 

5 

6 


Let U = arg max uG 5 


v £ S : 


U jj* Uu 


Set Ci = \ v £ S : 


Uy* U I r , 


< r}; 


< r 


Label a (u) = i for all u £ Cf, 
Update S <— S\C{. 


end 


7 If S ^ 0, then for any u £ S, set a(u) = argmin ig[fe] i J2 v£ Ci 


U U * Uqj 


Last but not least, we would like to emphasize that one needs not limit the initialization algo¬ 
rithm to the spectral methods introduced in this section. As Theorem 2 below shows, Algorithm 1 
works for any initialization method that satisfies a weak consistency condition. 

3 Theoretical properties 

Before stating the theoretical properties of the proposed method, we first review the minimax rate 
in [74], which will be used as the optimality benchmark. The minimax risk is governed by the 
following critical quantity, 


I* 



(14) 


which is the Renyi divergence of order | between Bern and Bern (-), i.e., Bernoulli distributions 
with success probabilities ^ and ^ respectively. Recall that 0<^<((<1—eis assumed throughout 






















the paper. It can be shown that I* x a n ^ . Moreover, when ^ = o(l), 

(/ 

= {2 + o(l))H 2 (Bern (£) , Bern (£)) , 

where H 2 (P , Q) = ± f(VdP - y/dQ ) 2 is the squared Hellinger distance between two distributions 
P and Q. The minimax rate for the parameter spaces (2) and (3) under the loss function (5) is 
given in the following theorem. 

Theorem 1 ([74]). When —> oo, we have 


r = (i + 0 (i)) 


(y/a - \/bf 


n 


= (l + o(l)) 


inf sup IEb,oT(<5\ a) 
° (s,+)ee 


exp (-(1 + r/)^) , k = 2 ; 
ex p(—(1 + '>!)%-), k>3, 


for both 0 = 0o(n, k , a, 6 , /3) and 0 = 0(n, k, a, b, A, /3; a) with any A < and any ft € [1, 1 / 5 / 3 ), 
where g = rj n ^ 0 is some sequence tending to 0 as n ^ 00 . 


Remark 1. The assumption ft € [1, ^5/3) is needed in [74] for some technical reason. Here, the 
parameter ft enters the minimax rates when k > 3 since the worst case is essentially when one has 
two communities of size //, while for k = 2 , the worst case is essentially two communities of size 
3. For all other results in this paper, we allow ft to be an arbitrary constant no less than 1. 

To this end, let us show that the two-stage algorithm proposed in Section 2 achieves the optimal 
misclassification proportion. The essence of the two-stage algorithm lies in the refinement scheme 
described in Algorithm 1 . As long as any initialization step satisfies a certain weak consistency 
criterion, the refinement step directly leads to a solution with optimal misclassification proportion. 
To be specific, the initialization step needs to satisfy the following condition. 


Condition 1. There exist constants Co, 8 > 0 and a positive sequence 7 = such that 

inf min P B a {((a, cr°) < 7 ) > 1 — C 0 n~( 1+5 \ (15) 

(S,cr)£0 ue[n] 


for some parameter space 0 . 


Under Condition 1, we have the following upper bounds regarding the performance of the 
proposed refinement scheme. 

Theorem 2. Suppose as n -+ 00 , ^/ogfc 00 > a x ^ an ^ Condition 1 is satisfied for 


7 = 0 


k log k 


(16) 


and 0 = 0o(n, k, a, b, ft). 


sup 

(B.+ee 


sup 

(B+ee 


Then there is a sequence 77 —> 0 such that 
^B,a |^(+S : ) > exp ^-(1 - ??)^-^ | -+ 0 , 
IP’s,a | £{(T,a) > exp ^-(1 - 7 )^^ | 0 , 


ifk = 2 , 
if k > 3, 


(17) 
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where I* is defined as in (14). 

If in addition Condition 1 is satisfied for 7 satisfying both (16) and 


7 = 0 


a — b 
ak 


( 18 ) 


and 0 = 0(n, k, a, b, A, /3; a), then the conclusion in (17) continues to hold for 0 = 0(n, k, a, b, A, /3; a). 


Theorem 2 assumes a x b. The case when a x b may not hold is considered in Section 6 . 
Compared with Theorem 1, the upper bounds (17) achieved by Algorithm 1 is minimax optimal. 
The condition (16) for the parameter space © 0 (n, k, a, b , /3) is very mild. When k = 0(1), it reduces 
to 7 = o(l) and simply means that the initialization should be weakly consistent at any rate. For 
k —>• 00 , it implies that the misclassification proportion within each community converges to zero. 
Note that if the initialization step gives wrong labels to all nodes in one particular community, 
then the misclassification proportion is at least 1/k. The condition (16) rules out this situation. 
For the parameter space 0(n, k, a, b, A, /3; a), an extra condition (18) is required. This is because 
estimating the connectivity matrix B in 0(n, k, a, b, A, /3; a) is harder than in © 0 (n,k,a,b, f3). In 
other words, if we do not pursue adaptive estimation, (18) is not needed. 

Remark 2. Theorem 2 is an adaptive result without assuming the knowledge of a and b. When 
these two parameters are known, we can directly use a and b in (11) of Algorithm 1. By scrutinizing 
the proof of Theorem 2, the conditions (16) and (18) can be weakened as 7 = o(k -1 ) in this case. 

Given the results of Theorem 2, it remains to check the initialization step via spectral clus¬ 
tering satisfies Condition 1 . For matrix P = (P uv ) = ( B a {u)a{v )) with ( B,a ) belonging to either 
©o(n, k , a, b , (3 ) or 0(n, k, a, b, A, /3; a), we use A*, to denote Afc(-P). Define the average degree by 


d = 



Ite[n] 


(19) 


Theorem 3. Assume e < a < C\b for some constant C\ > 0 and 

ka 



( 20 ) 


for some sufficiently small c £ (0,1). Consider USC(t ) with a sufficiently small constant // > 0 in 
Algorithm 2 and t = C-^d for some sufficiently large constant C 2 > 0. For any constant C' > 0, 
there exists some C > 0 only depending on C',C\,C 2 and /x such that 


with probability at least 1 — n 


If k is fixed, the same conclusion holds without assuming a < C±b. 


Remark 3. Theorem 3 improves the error bound for spectral clustering in [45]. While [45] requires 
the assumption a > Clogn, our result also holds for a = o(log?x). A result close to ours is that by 
[19], but their clustering step is different from Algorithm 2. Moreover, the conclusion of Theorem 3 
holds with probability 1 — n~ c for an arbitrary large C ', which is critical because the initialization 
step needs to satisfy Condition 1 for the subsequent refinement step to work. On the other hand, 
the bound in [19] is stated with probability 1 — o(l). 

When k = 0(1), Theorem 2 and Theorem 3 jointly imply the following result. 
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Corollary 3.1. Consider Algorithm 1 initialized by cr° with USC(r) for r = Cd, where C is a 
sufficiently large constant. Suppose as n — > oo, k = 0(1), ^ ^ oo and a x b. Then, there 
exists a sequence g —> 0 such that 


sup F B ,a 


sup P B,a 
(B,a) 60 


|l?(<7, a) > exp 
< I{p, a) > exp 


(-(!-»/) 


nl* \ 

nl* \ 

Jk) 



if k = 2, 
if k > 3, 


where the parameter space is 0 = ©o(n, k, a , b, /3). 

Compared with Theorem 1 , the proposed procedure achieves the minimax rate under the condi¬ 
tion ( ' a ^ -)> oo and a x b. When k = 0(1), the condition ^ - > oo is necessary and sufficient 

for weak consistency in view of Theorem 1. More general results including the case of k —> oo are 
stated and discussed in Section 6. 

The following theorem characterizes the misclassification rate of normalized spectral clustering. 


Theorem 4. Assume e < a < C±b for some constant C\ > 0 and 


ka log a 


< c, 


( 21 ) 


for some sufficiently small c E (0,1). Consider NSC(t ) with a sufficiently small constant p > 0 
in Algorithm 2 and t = C 2 d for some sufficiently large constant O 2 > 0. Then, for any constant 
C > 0, there exists some C > 0 only depending on C\ C\. O 2 and p such that 


£(a,a) < C 


a log a 

■AT 


with probability at least 1 — n 


If k is fixed, the same conclusion holds without assuming a < C\b. 


Remark 4. A slightly different regularization of normalized spectral clustering is studied by [57] 
only for the dense regime, while Theorem 4 holds under both dense and sparse regimes. Moreover, 
our result also improves that of [42] due to our tighter bound on ||L(A r ) — L(P r )|| op in Lemma 7 
below. We conjecture that the log a factor in both the assumption and the bound of Theorem 4 
can be removed. 

Note that Theorem 3 and Theorem 4 are stated in terms of the quantity A*,. We may specialize 
the results into the parameter spaces defined in (2) and (3). By Proposition 1, A*, > for 
©o(n, k , a, b , (3 ) and Xf. > A for @(n, k, a, b, A, /3; a). The implications of Theorem 3 and Theorem 4 
and their uses as the initialization step for Algorithm 1 are discussed in full details in Section 6. 


4 Numerical results 

In this section we present the performance of the proposed algorithm on simulated datasets. The 
experiments cover three different scenarios: (1) dense network with communities of equal sizes; (2) 
dense network with communities of unequal sizes; and (3) sparse network. Recall the definition of d 
in (19). For each setting, we report results of Algorithm 1 initialized with four different approaches: 
USC(oo), USC(2d), NSC(O) and NSC(J), the description of which can all be found in Section 2.3. 
For all these spectral clustering methods, Algorithm 2 was used to cluster the leading eigenvectors. 
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The constant ju in the critical radius definition was set to be 0.5 in all the results reported here. 
For each setting, the results are based on 100 independent draws from the underlying stochastic 
block model. 

To achieve faster running time, we also ran a simplified version of Algorithm 1. Instead of 
obtaining n different initializers {&u}ue[n] t° refine each node separately, the simplified algorithm 
refines all the nodes with a single initialization on the whole network. Thus, the running time can 
be reduced roughly by a factor of n. Simulation results below suggest that the simplified version 
achieves similar performances to that of Algorithm 1 in all the settings we have considered. For the 
precise description of the simplified algorithm, we refer readers to Algorithm 3 in the appendix. 


Balanced case In this setting, we generate networks with 2500 nodes and 10 communities, each 
of which consists of 250 nodes, and we set Bn = 0.48 for all i and Bij = 0.32 for all i ^ j. Figure 1 
shows the boxplots of the number of misclassified nodes. The first four boxplots correspond to the 
four different spectral clustering methods, in the order of USC(oo), USC(2d), NSC(0) and NSC(d). 
The middle four correspond to the results achieved by applying the simplified refinement scheme 
with these four initialization methods, and the last four show the results of Algorithm 1 with these 
four initialization methods. Regardless of the initialization method, Algorithm 1 or its simplified 
version reduces the number of misclassified nodes from around 30 to around 5. 



Refine (Simple) 
with USCH 


Refine (Simple) 
with USC(2cf) 


Refine (Simple) 
with NSC(O) 


Refine (Simple) 
with NSC(d) 


Refine with 
USCH 


Refine with 
USC(2d) 


Refine with 
NSC(0) 


Refine with 
NSC(H) 


Figure 1: Boxplots of number of misclassified nodes: Balanced case. Simple indicates that the 
simplified version of Algorithm 1 is used instead. 


Imbalanced case In this setting, we generate networks with 2000 nodes and 4 communities, the 
sizes of which are 200,400,600 and 800, respectively. The connectivity matrix is 


/ 0.50 

0.29 

0.35 

0.25\ 

0.29 

0.45 

0.25 

0.30 

0.35 

0.25 

0.50 

0.35 

\0.25 

0.30 

0.35 

0.45/ 


Hence, the within-community edge probability is no smaller than 0.45 while the between-community 
edge probability is no greater than 0.35, and the underlying SBM is inhomogeneous. Figure 2 shows 
the boxplots of the number of misclassified nodes obtained by different initialization methods and 
their refinements, and the boxplots are presented in the same order as those in Figure 1. Similarly, 
we can see refinement significantly reduces the error. 


Sparse case In this setting we consider a much sparser stochastic block model than the previous 
two cases. In particular, each simulated network has 4000 nodes, divided into 10 communities all 
of size 400. We set all Bn = 0.032 and all B^ = 0.005 when i ^ j. The average degree of each 
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USC(oo) USC(2d) NSC(O) NSC(d) Refine (Simple) Refine (Simple) Refine (Simple) Refine (Simple) Refine with Refine with Refine with Refine with 


with USCH with USC(2d) with NSC(O) with NSC(d) USCH USC(2d) NSC(O) NSC(d) 


Figure 2: Boxplots of number of misclassified nodes: imbalanced case. Simple indicates that the 
simplified version of Algorithm 1 is used instead. 


node in the network is around 30. Figure 3 shows the boxplots of the number of misclassified nodes 
obtained by different initialization methods and their refinements, and the boxplots are presented 
in the same order as those in Figure 1. Compared with either USC or NSC initialization, refinement 
reduces the number of misclassified nodes by 50%. 



USC(°°) USC(2d) NSC(O) NSC(d) Refine (Simple) Refine (Simple) Refine (Simple) Refine (Simple) Refine with Refine with Refine with Refine with 

with USCH with USC(2d) with NSC(O) with NSC(d) USCH USC(2d) NSC(O) NSC(d) 



Figure 3: Boxplots of number of misclassified nodes: Sparse case. Simple indicates that the 
simplified version of Algorithm 1 is used instead. 


Summary In all three simulation settings, for all four initialization approaches considered, the 
refinement scheme in Algorithm 1 (and its simplified version) was able to significantly reduce the 
number of misclassified nodes, which is in agreement with the theoretical properties presented in 
Section 3. 

5 Real data example 

We now compare the results of our algorithm and some existing methods on a political blog dataset 
[3]. Each node in this network represents a blog about US politics and a pair of nodes is connected 
if one blog contains a link to the other. There were 1490 nodes to start with, each labeled liberal 
or conservative. In what follows, we consider only the 1222 nodes located in the largest connected 
component of the network. This pre-processing step is the same as what was done in [40]. After 
pre-processing, the network has 586 liberal blogs and 636 conservative ones which naturally form 
two communities. As shown in the right panel of Figure 4, nodes are more likely to be connected 
if they have the same political ideology. 

Table 1 summarizes the results of Algorithm 1 and its simplified version on this network with 
four different initialization methods, as well as the performances of directly applying the four 
methods on the dataset. The average degree of the network d is 27, which is used as the tuning 
parameter for regularized NSC. For regularized USC, we set r equals to twice the average degree, 
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Figure 4: Connectivity of political blogs. Left panel: plot of the adjacency matrix when the nodes 
are not grouped. Right panel: plot of the adjacency matrix when the nodes are grouped according 
to political ideology. 


Initialization 

USC(oo) 

USC(2d) 

NSC(0) 

NSC(d) 

Refinement 

NA 

Algol 

Simple 

NA 

Algol 

Simple 

NA 

Algol 

Simple 

NA 

Algol 

Simple 

No. of nodes 
misclassified 

383 

116 

115 

583 

307 

294 

579 

585 

581 

308 

86 

87 


Table 1: Performance on the political blog dataset. “NA” stands for direct application of the 
initialization method on the whole dataset; “Algo 1” stands for the application of Algorithm 1 with 
<7° being the labeled initialization method; “Simple” stands for the application of the simplified 
version of Algorithm 1 with <7° being the labeled initialization method. 


leading to the removal of 196 most connected nodes. The result of directly applying any of the four 
spectral clustering based initializations was unsatisfactory with at least 30% nodes misclassified. 
Despite the unsatisfactory performance of the initializers, Algorithm 1 and its simplified version 
are able to significantly reduce the number of misclassified nodes except for the case of NSC(O), 
and the performance of the two are close to each other regardless of the initialization method. 

An interesting observation is that if we apply the refinement scheme multiple times, the number 
of misclassified nodes keeps decreasing until convergence and the further reduction of misclassifi- 
cation proportion compared to a single refinement can be sizable. Figure 5 plots the numbers of 
misclassified nodes for multiple iterations of refinement via the simplified version of Algorithm 1. 
We are able to achieve 61, 58 or 63 misclassified nodes out of 1222 depending on which initializa¬ 
tion method is used. For the three initialization methods included in the figure, the number of 
misclassified nodes converges within several iterations. NSC with r = 0 is not included in Figure 5 
due to the relatively inferior initialization, but its error also converges to around 60/1222 after 20 
iterations. For comparison, state-of-the-art method such as SCORE [37] was reported to achieve 
a comparable error of 58/1222. It is worth noting that SCORE was designed under the setting of 
degree-corrected stochastic block model, which fits the current dataset better than SBM due to the 
presence of hubs and low-degree nodes. The regularized spectral clustering implemented by [57], 
which was also designed under the degree-corrected stochastic block model, was reported to have 
an error of (80 ± 2)/1222. The semi-definite programming method by [13] achieved 63/1222. 

To summarize, our algorithm leads to significant performance improvement over several popular 
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Figure 5: Number of misclassified nodes vs. number of refinement scheme applied 


spectral clustering based methods on the political blog dataset. With repeated refinements, it 
demonstrates competitive performance even when compared with methods designed for models 
that better fit the current dataset. 


6 Discussion 

In this section, we discuss a few important issues related to the methodology and theory we have 
presented in the previous sections. 


6.1 Error bounds when a x b may not hold 

In Section 3, we established upper bounds on misclassification proportion under the assumption of 
a x b. The following theorem shows that slightly weaker upper bounds can be obtained even when 
a x b does not hold. To state the result, recall that we assume throughout the paper ^ < 1 — e for 
some numeric constant e G (0,1). 

Theorem 5. Suppose as n —>• oo, ^ lQ g k —> oo and Condition 1 is satisfied for 7 satisfying (16) 
and 0 = ©o (n,k,a,b,j3). Then for some positive constants c e and C e that depend only on e, for 
any sufficiently small constant eo G (0, c e ), if we replace the definition oft u ’s in (11) with 


( 1 a u ( 1 - bu/ri) 


A log 


eo/2’ 


( 22 ) 


then we have 


sup F b,<t \ a) > exp ( -(1 - C e e 0 )7 ) \ °> if k = 2 


(B,t r)e© 


(B,t r)e© 


nl* \ 


sup F b ,o- { > ex P -(! - C e e 0 )— > ->■ 0, ifk> 3 


fik ) 


(23) 


where I* is defined as in (14). In particular, we can set C e = ^ m og £ 2 and c e = min(jQ^, 

If in addition Condition 1 is satisfied for 7 satisfying both (16) and (18) and 0 = 0(n, k, a , b , A, j3\ a), 
then the same conclusion holds for 0 = 0(n, k, a , 6 , A, /3; a). 
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Compared with the conclusion (17) in Theorem 2, the vanish sequence p in the exponent of the 
upper bound is replaced by C t e o, which is guaranteed to be smaller than min(0.1, log( - 2 9/ ^ ) and can 
be driven to be arbitrarily small by decreasing eo- To achieve this, the t u 's used in defining the 
penalty parameters in the penalized neighbor voting step need to be truncated at the value log 


6.2 Implications of the results 

We now discuss some implications of the results in Theorems 2-5. 

When using USC as initialization for Algorithm 1, we obtain the following results by combining 
Theorem 2, Theorem 3 and Theorem 5. Recall that d is the average degree of nodes in A defined 
in (19). 

Theorem 6. Consider Algorithm 1 initialized by o -0 with USC{t) with r = Cd for some sufficiently 
large constant C > 0. If as n —> oo, a x b and 


(a - b ) 2 

ak 3 log k 


(24) 


then there is a sequence r) —> 0 such that (17) holds with 0 = ©o(n, k, a, b, (3). If as n —»• oo, a x b 
and 


A 2 

a A: (log k + a/(a 



(25) 


then (17) holds for 0 = 0(n, k, a, b, A, /?; a). If for either parameter space, a^b may not hold but 
k is fixed and (24) or (25) holds respectively, then (23) holds as long as t u is replaced by (22) in 
Algorithm 1. 


Compared with Theorem 1, the minimax optimal performance is achieved under mild conditions. 
Take 0 = 0o (n,k,a,b, (3) for example. For any fixed k, the minimax optimal misclassification 
proportion is achieved with high probability only under the additional condition of a x b. In 

addition, weak consistency is achieved for fixed k as long as - > oo, regardless of the behavior 

of |. This condition is indeed necessary and sufficient for weak consistency. See, for instance, 
[51, 53, 73, 74], To achieve strong consistency for fixed k, it suffices to ensure £(a,a) < ^ and 
Theorem 6 implies that it is sufficient to have 

nl* nl* 

liminf — - > 1, when k = 2; liminf ——- > 1, when k > 3, (26) 

n->oo 21ogn n-y oo pklogn 

regardless of the behavior of |. On the other hand, Theorem 1 shows that it is impossible to achieve 
strong consistency if 


lirnsup — - < 1, when k = 2; limsup ——- < 1, when k > 3. (27) 

n—yoc 2 log n n—yoo pk log n 

When ^ = o(l), nl* = (1 + o(l))(y / a — vA) 2 and so one can replace nl* in (26) - (27) with 
(yfa — Vb) 2 . In the literature, Abbe et al. [2], Mossel et al. [54] and Hajek et al. [31] obtained 
comparable strong consistency results via efficient algorithms for the special case of two communities 
of equal sizes, i.e., k = 2 and /3 = 1. Under the additional assumption of a x b x logn, Hajek 
et al. [32] later achieved the result via efficient algorithm for the case of fixed k and (5 = 1, and 
Abbe and Sandon [1] investigated the case of fixed k and (3 > 1. In comparison, our result holds 
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for any fixed k and any (3 > 1 without assuming axfix logn. In the weak consistency regime, in 
terms of misclassification proportion, for the special case of k = 2 and (3 = 1, Yun and Proutiere 
[72] achieved the optimal rate for ©o(n,2,a,6,1) when a x b x a — b, while the error bounds in 
other papers are typically off by a constant multiplier on the exponent. In comparison, Theorem 6 
provides optimal results (17) and near optimal results (23) for a much broader class of models under 
much weaker conditions. Last but not least, our algorithm can provably achieve strong consistency 
and minimax optimal performance even for growing k, which to our limited knowledge, is the first 
in the literature. 

The performance of Algorithm 1 initialized by NSC can be summarized as the following theorem 
by combining Theorem 2, Theorem 4 and Theorem 5. In this case, the sufficient condition for 
achieving minimax optimal performance is slightly stronger than when USC is used for initialization. 

Theorem 7. Consider Algorithm 1 initialized by u 0 with NSC(t) with r = Cd for some sufficiently 
large constant C > 0. If as n —> oo, a x b and 


(q - b ) 2 

ak 3 log k log a 


(28) 


then there is a sequence g —> 0 such that (17) holds with 0 = ©o (n, k, a, b , (3). If as n —>• oo, ox6 
and 


_ A 2 

ak log a(log k + a/(a 


b)) 


—> oo, 


(29) 


then (17) holds for © = 0(n, k , a, b, A, (3\ a). If for either parameter space, a x b may not hold but 
k is fixed and (28) or (29) holds respectively, then (23) holds as long as t u is replaced by (22) in 
Algorithm 1. 


Last but not least, we would like to point out that when the key parameters a and b are known, 
we can obtain the desired performance guarantee under weaker conditions as summarized in the 
following theorem. 


Theorem 8 (The case of known a, b ). Suppose a, b are known. Consider Algorithm 1 initialized by 
a 0 with JJSC(t) with t = Ca for some sufficiently large constant C > 0 and a u = a, b u = b in (9) 
for all u € [n]. If as n —> oo, a x b and 


(a ~ b ) 2 

ak 3 


—> oo, 


(30) 


then there is a sequence g -» 0 such that (17) holds with 0 = ©o(n, k, a, b, (3). If as n —>• oo, a x b 
and 


A 2 

ak 


oo, 


(31) 


then (17) holds with © = ©(n, k, a, b, A, (3; a). If for either parameter space without assuming axi, 
(30) or (31) holds respectively, then (23) holds if in addition t u is replaced by (22). 

If instead NSC(t) is used for initialization with r = Ca for some sufficiently large constant C > 
0, then the above conclusions hold if we replace (30) with a ^ 3 ^ a oo and (31) with ak ^ ga —> oo, 
respectively. 
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6.3 Potential future research problems 

Simplified version of Algorithm 1 and iterative refinement In simulation studies, we ex¬ 
perimented a simplified version of Algorithm 1 (with precise description as Algorithm 3 in appendix) 
and showed that it provided similar performance to Algorithm 1 on simulated datasets. Moreover, 
for the political blog data, we showed that iterative application of this simplified refinement scheme 
kept driving down the number of misclassified nodes till convergence. It is of great interest to see 
if comparable theoretical results to Theorem 2 could be established for the simplified and/or the 
iterative version, and if the iterative version converges to a local optimum of certain objective func¬ 
tion for community detection. Though answering these intriguing questions is beyond the scope of 
the current paper, we think it can serve as an interesting future research problem. 

Data-driven choice of k The knowledge of k is assumed and is used in both methodology 
and theory of the present paper. Date-driven choice of k is of both practical importance and 
contemporary research interest, and researchers have proposed various ways to achieve this goal 
for stochastic block model, including cross-validation [17], Tracy-Widom test [44], information 
criterion [60], likelihood ratio test [68], etc. Whether these methods are optimal and whether it is 
possible to select A; in a statistically optimal way remains an important open problem. 

More general models The results in this paper cover a large range of parameter spaces for 
stochastic block models and we show the competitive performance of the proposed algorithm both 
in theory and on numerical examples. Despite its popularity, stochastic block model has its own 
limits for modeling network data. Therefore, an important future research direction is to design 
computationally feasible algorithms that can achieve statistically optimal performance for more 
general network models, such as degree-corrected stochastic block models. 


7 Proofs of main results 

The main result of the paper, Theorem 2, is proved in Section 7.1. Theorem 3 and Theorem 4 are 
proved in Section 7.2 and Section 7.3 respectively. The proofs of the remaining results, together 
with some auxiliary lemmas, are given in the appendix. 


7.1 Proof of Theorem 2 


We first state a lemma that guarantees the accuracy of parameter estimation in Algorithm 1. 

Lemma 1. Let 0 = 0(n, k, a, b, A, /3; a). Suppose as n —>• oo, - > oo and Condition 1 holds 

with 7 satisfying (16) and (18). Then there is a sequence ry — > 0 as n —> oo and a constant C > 0 
such that 

min inf P < min max Bf, — I < n ( - ] 1 > 1 — Cn~^ 1+5 \ (32) 

uG[n]{B,c r)£0 \neS k iJe[k]' V W{3) ~ \ n V ' 

For 0 = 0o (n,k,a,b, /3), the conclusion (32) continues to hold even when the assumption (18) is 
dropped. 

Proof. 1° Let 0 = 0(n, k, a, b , A, /3; a). For any community assignments o\ and 02 , define 


4 (oi ,02) 


1 

n 


n 

0)7^2 0 )}- 

U= 1 


(33) 
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(34) 


Fix any (B,cr) E © and u E [n]. Define event 

E u = {£ 0 (jf u (cr),^u) < 7} ■ 

To simplify notation, assume that ir u = Id is the identity permutation. 

Fix any i E [k\. On E u , 

rii > |Cf PI Ci\ >rn — 71 n, \Cf n Cf \ < 72 U,, where 71,72 > 0 and 71 + 72 < 7 . (35) 


Let C\ be any deterministic subset of [n] such that (35) holds with Cf replaced by C[. By definition, 
there are at most 


in / \ in 

e(?)e 


z=o 


m =0 


m 


< ( 7 n + l ) 2 ( ) ( — ) < exp \ 2 log( 7 n + 1 ) + 27 n log — 


772 


7 n 


( en\ 


7 n 


\l n J 






7 


< exp < Ci'yn log 


different subsets with this property where C\ > 0 is an absolute constant. Let £[ be the edges 
within C[. Then \£[\ consists of independent Bernoulli random variables, where at least (1 — 
Pjk) 2 proportion of them follow the Bern(i?jj) distribution, at most (/Ly k) 2 proportion that are 
stochastically smaller than Bern(^) and stochastically larger than Bern(^), and at most 2/3'yk 
proportion are stochastically smaller than Bern(^). Therefore, we obtain that 


(l-p'yk) 2 B ii + {P'yk) 2 -<E 

n 


131 


*131(131 "1) 


< max 


o 9 da b 

2 D 1 -h 2 1- 

n n 


(1 - t) 2 Ba + t 


(36) 


Note that the LHS is (1 — (2 + o(l))/ 37 fc)Bjj. On the other hand, under condition (18), the RHS is 
attained at t = 0 and equals Bu exactly. Thus, we conclude that 


E 


\£’\ 


*I3I( I3i-i) 


— Bi, 


< Cftyk™ = V ( — 
n \ n 


(37) 


for some rj 0 that depends only on a, k } a,/3 and 7 , where the last inequality is due to (18). 
On the other hand, by Bernstein’s inequality, for any t > 0, 


t 2 


Wl—E|3I|> ( }<2«P - 


Let 


, era, 


t 2 = (rii + 'yn) 2 — (Ciynlogy 1 + (3 + 5) logn) V ( 2 Ci 7 nlog 7 1 + 2(3 + 5) logn) 


n 


~(^ v/a 7 log7 1 + 7 n 1°S7 *) , 


where we the second inequality holds since is monotone decreasing as x increases and so 
7 logy -1 > 7 logn for any 7 > 7, which is the case of most interest since 7 < 7 leads to 7 = 0 
and so the initialization is already perfect. Even when 7 = 0, we can still continue to the following 
arguments by replacing every 7 with 7 and all the steps continue to hold. Thus, we obtain that 
for positive constant C a ^ t g that depends only on a, (3 and 5, 

I"{|I3I - E |£ll| > Ca,p,6 (”v /a 7 1 °g7 ” 1 + 7 U log 7 _1 ) | < exp {—Ciynlogy -1 } n^ (3+<5 \ (38) 
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Thus, with probability at least 1 — exp {— Ciynlogy n ( 3+<5 ), 


131 


||C'|(|C'| - 1 ) 


— E 


131 


*131(131-i) 


/ _ 

< Ca,0,5 — V a 7 l°g7 _1 + 


n 


W , fe 2 7log7 ] \ =r] '( a ^L 
n ) I n 


(39) 


where rf —> 0 depends only on a, k, a, (3 ,7 and 6. Here, the last inequality holds since 

ky /ay logy -1 = Vaky/k'y logy -1 , 

where \fak <C a — b since - > 00 and fcylogy ^ 1 = 0 ( 1 ), and 

( ^ \ 2 

k 2 y logy _1 = &y logy ~ l ■ k < k <^. - < a — b. 

a 

We combine (37) and (39) and apply the union bound to obtain that for a sequence rj —> 0 that 
depends only on a, k, a , ft, y and <5, with probability at least 1 — ri — ( 3+<5 ) 


13 * 


113 * 1 ( 13 * 1 - 1 ) 


— Br 



(40) 


The proof for Hy estimation is analogous and hence is omitted. A final union bound on i , j E [fc] 
leads to the desired claim since all the constants and vanishing sequences in the above analysis 
depend only on a, b , k, a, fi, y and 5, but not on it, B or a. 

2° If 0 = ©o (n,k,a,b, /3), then condition (18) on y is no longer needed. This is because (36) 
can be replaced by 


min 

te[o,/37fc] 


(1 ~t) 2 ~ 
n 


, > ,61 

+ 2 f(l — t) -b t — ? 

n n J 


< E 



5131(131 -1) 


< max / (1 — t ) 2 —b t 2 —b 2 f(l 
ie[o,/ 37 fc] ( n n 



where the LHS equals ^ — (1 — /3yfe(l + o( l)))^p = ^ + o(^) and the RHS equals Thus, no 
additional condition is needed to guarantee (37) in the foregoing arguments. This completes the 
proof. □ 


The next two lemmas establish the desired error bound for the node-wise refinement. 

Lemma 2. Let ©o be defined as in (2) and k > 2. Suppose as n —>• 00 , -* 00 an d a x b. 

If there exists two sequences y = o(l/k) and p = o(l), constants C,5 > 0 and permutations 
{ 7 r u }” =1 C Sk such that 

inf min P , < 7 °) < y, \a u — a\ < p(a — b), \b u — b\ < p(a — 6)1 > 1 — Cn~^ 1+S \ 

(B,cr)e © 0 «£[«] 1 ) 

(II) 

Then for a u {u) defined as in (10) with p = p u in (12), there is a sequence rf = o(l) such that for 

k = 2, 

f nl* 1 

sup rnaxP {cr u (u) 7r n (cr(ii))} < (k - 1) exp < -(1 - rf)—— > + CVt3 1+< 3 

(s,ff)e 0 o ue W l 2 J 

and for k > 3, 

sup rnaxP {a u (u) tt u {ct(u))} < (k - 1 ) exp (-(1 - yOyyrj + Cn~( 1+S \ 

(B,o-)e 0 o ue W l J 
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Proof. In what follows, let E u denote the event in (41). For the sake of brevity, we let p = a/n, 
q = b/n, p u = a u /n and q u = b u /n. Moreover, let a u = n u (a), m = |{u : a u (v) = i}|, m; = |{v : 
a u( v ) = *}| aR d rrif = |{i> : a^(v) = a u (v ) = i}|. Without loss of generality, let o u (u) = 1. 

Then we have 


(ju ( u) -f— 1 and Eu } E IP \ Eu and E -A-uv E Auv E Pu(mi n^i) / — ^ pi- (42) 


¥ i 


T u (v)=l 


cr u (v )=1 


1^1 


Now we bound each pi. By the independence structure and Chernoff bound, we have 
Pi < e| exp (-t u p u (mi - mi)) ( qe tu + 1 - q) m i(pe tu + 1 - p) mi ~ m i 


xE 


qe tu + 1 — q 


pe 


tni 


+ 1 —p 


l { em } 

( 43 ) 

>e- tu + l-p) mi 1{Eu }} 

( 44 ) 

\ mi-m\ \ 


) 1 {Eu} | • 

( 45 ) 


We are going to give bounds for the terms in (44) and (45) respectively. Before doing that, we need 
some preparatory inequalities. Define t* through the equation 


e 


t* 


P( 1 ~ g) 
9(1 -P)' 


Then, on the event E u , 


Ju-t* I 


+ e 


< exp(C'i? 7 ), 


for some constant C\ > 0. Moreover, 


\e tu — II V \e~ tu - II < C 2 P -^ = C 2 a 


p a 

for some constant C 2 > 0. Therefore, for the term in (44), on the event E u , 

exp (- t u p u {mi - mi)) ( qe tu + 1 - q) mi (pe~ tu + 1 - p) mi 
= exp (- t u p u (mi - mi)) ( qe tu + 1 - q ) (m * mi)/2 (pe~ tu + 1 - p) (mi m ' )/2 

x ( qe tu + l _ q ^ +mi )/2 ^ oe _ tu + 1 _ ^(nu+mO/2 

By (46), the term in (49) is upper bounded by 


(pq + (1 - P)(l - 9) + VP9\/ (1 - p)(l - 9)(e*“ r +e r *“)) 


2 


< exp -(1 + o(l)) 


mi + rrq 


< exp -(1 + o(l)) 


ni + m 


r 


(46) 


(47) 


(48) 

(49) 


21 













By (47), the term in (48) is upper bounded by 


exp (-tuPuimi - mi)) (qe tu + 1 - q) inH ™ l)/2 (pe tu + 1 - p) 
„-t u + i - p p u e~ tu + 1 -p u 




< 


exp 

exp 


777-1 — m l 
2 

| mi — mi 


log 


pe 


- log : 


qe tu + l-q q u e tu + 1 -q u 

e~ tu - l\\p u -p\ + \e tu - l\\q u - q\) 


/ , , n (p — q ) 2 

exp ( o(l)— 


Therefore, we can upper bound (44) as 


E i)( ge t» + i _ q ) m i(pe + l-p) mi l {Eu} } < exp ^-(1 + 


Now we provide an upper bound for (45). By (47), on E u , 

/( yP-Q ) 2 


pe tu + 1 - p = 1 + (p-g)(e tu - 1 ) < ^ + Q 


qe tu + 1 — q 


and 


qe tu + 1 - q 
pe~ tu + 1 — p 


= 1 + 


qe tu + 1 — q 

(p-q)( l~e~ tu ) 
pe~ tu + 1 — p 


p 


<1 + 0 


(p - qf 

p 


< exp O 


< exp ( O 


om) n -++r 


(p - qf 


(.p - qf 

p 


Therefore, 


E ■ 


By combining (50) and (51), we have 


z I /-i . /-i \ \ 77 1 4“ 77/ . 

Pi < exp -(1 + o(l))—-—/ 


Using (42), this implies 


and so 


’ {a u (u) 1 and E u } < (k — 1) exp ( —(1 + o(l)) min 

+1 


n\ + ni 


’{< 7 u (it) 1 } < (k - 1 ) exp ( -(1 + o(l)) min 

+1 


77i + ni 


J* ] + Cn~ {1+5) . 


(50) 


■ <«> 


(52) 


When k = 2, min/^ ( ni + n; ) = and when k > 3, min/^i ( n> + n > ) > 4p Thus, the proof i 


complete. 


is 

□ 


Lemma 3. Let 0 be defined as in (3) and k > 2. Suppose as n —> oo, a a ^ - > oo and a x 6 . 

If there exists two sequences 7 = o ( 777 ) and 7 = o(l), constants C,5 > 0 and permutations 
{ 7 r u}”=i U iSfc such that (41) holds. Then for a u (u) defined as in (10) with p = p u in (12), f/ie 
conclusions of Lemma 2 continue to hold. 
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Proof. The proof is similar to that of Lemma 2 and we use the same notation as there. First, we 
give a bound for pi defined in (42). Let Xj ~ Bern(g), Yj ~ Bern(p) and Z 3 ~ Bern(ap), j > 1, be 
mutually independent. Then, a stochastic order argument gives 


mi—m j 


pi < E P LE>+ E Z i-Y Yj > p(mi — m±) and E U \A_ Z 
j =i j =i i=l 

< E{exp(-t u p u (m; - mi)) (ge*“ + 1 - q) mi (jpe~ tu + 1 -p) mi l {Eu} } 


(53) 


xE 


qe tu + 1 — q 


mi—m J 


pe 


tri 


1-P 


m\—m\ 


{ape tu + ! _ .(54) 


Note that the term in (53) is the same as that in (44), and thus it can be upper bounded by (50) 
as before. To bound for (54), observe that by (47), 


1 


qe tu + 1 — q 


1 


pe tu + 1 — p 


< exp (q\e tu — 1 |) < exp (■ 0(p — q )), 

< exp (Cp\e~ tu — 1|) < exp (0(p — q)) 


and 


ape tu + 1 — ap < exp (ap\e tu — 1 |) < exp (0(p — q)). 

Thus, under the assumption 7 = the term (54) is bounded by exp (o(l) ni ^ rti /*). The 

remaining proof is the same as that of Lemma 2. □ 


Finally, we need a lemma to justify the consensus step in Algorithm 1. 

Lemma 4. For any community assignments a and a': [n] —> [k], such that for some constant 

C > 1 

Ti 1 

mini {u : a(u) = 1} I, mini \u : a'[u) = 1} I > ——, and min £o(cr, Trier')) < 

ie[k] ze[fc] L ’ C k ireSk Ck 

Define map £ : [k] —> [k] as 

£(i) = argmax |{it : a(u) = 1} n {u : cr'(u) = i}| , Vi E [k]. (55) 


Then £ E S k and £ 0 (cr,£,(cr')) = min^g^ £ 0 (<j,ir(cr')). 

Proof. By the definition in (55), we obtain 

£ = argmin £o(a, £,'(&')), and £ 0 (a, £(cr')) < min £ 0 (a, tt( a')) < 

£':[fc]^[/c] ^ S k L/C 


Thus, what remains to be shown is that £ E Sk, i.e., £(Zi) f(h) for any l\ I 2 . To this end, note 
that if for some l\ I 2 , £(Zi) = then there would exist some Iq E [k] such that for any l E [k], 

£(Z) Iq, and so 


4(<t£(c')) > 


n ^2 

u:ct{u)=Iq 


{u : a(u) = Ip} 1 J_ 
n ~ Ck' 


This is in contradiction to the second last display, and hence £ E Sk- This completes the proof. □ 
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Proof of Theorem 2. Let 0 = @(n, k, a, b, A, /3; a), and fix any (B,i t) E 0. For any u E [n], 
by Condition 1 and the fact that < 7 ® and a u differ only at the community assignment of u, for 
y = 7 + l/n, there exists some ir u E Sk such that 

Pj^o^TT^ 1 ^)) <7n} > 1 - c 0 n~ il+s \ (56) 

Without loss of generality, we assume tt\ = Id is the identity map. Now for any fixed u E {2,..., n}, 
define map £, u : [A;] -» [k] as in (55) with a and a' replaced by d\ and a u . Then by definition 

&(u) = £u(ctu(u))- ( 57 ) 

In addition, (56) implies with probability at least 1 — Cn~( 1+S \ we have 

£o(cr,di)<y and £q{o, 7 r“ 1 (ct u )) < 7 '. 

So the triangle inequality implies iQ{di 1 'K~ 1 {d u )) < 2 y' and hence the condition of Lemma 4 is 
satisfied. Thus, Lemma 4 implies 

F = vr- 1 } > 1 -Cn~^ +5 \ (58) 

When k > 3, Lemma 1, (16) and (18) imply that the condition of Lemma 3 is satisfied, which 
in turn implies that for a sequence rf = o(l), 


'{v{u) y a(u)} =P{£ u (a u (u)) y cr(it)} 

< W{€u(eu(u)) + a(u), = 7 r" 1 } tt" 1 } 

< P y TT u {cr(u))} +P{£u + TT^r 1 } 

< C'ti _(1+<5 ) + (k - 1) exp |-(1 - j • 


Set 


V = r / + P\l ~77 = o(1 ) 


( 59 ) 


where the last inequality holds since ^ ^ 


ak 


00 . Thus, Markov’s inequality leads to 


Til* 

P \ £ 0 (a,a) > (k— l)exp <J -(1 - 77 ) — 


< 


1 


1 


(k - 1 ) exp j-(l - 77 )^} n u=\ 


^>{£( 77 ) y a(u)} 


< ex P { } + 


Cn-^+V 


nl* 


(k ~ l)exp|-(l- 77 )^} 

Cn-^+V 


- 6XP ^ A/ ^ J + (k-l) exp {-(1-77)^}' 


If (k — l)exp| — (1 — rj) | > n ( 1+<5 / 2 ), then 


nl 


P<| > (k- 1) exp <j -(1 - 77 )^- \ \ < exp { \ +Cn 5/2 = o(l). 
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If (k — 1) exp | — (1 — ??) 2 p-1 < n ( 1+5 / 2 ) ) then 


l 0 {a,d) > (k - 1 ) exp j-(l - = P {4 (cr, ?) > 0 } < ^P{a(u) / a(u )} 

U= 1 

^] + Cn-‘<Cn-V’=o( 1 ). 

Here, the second last inequality holds since ry > rf and so (k — l)exp{—(1 — rf)nl*/((3k)} < (k — 
l)exp{—(1 — rj)nl* / {/3k)} < n~^ l+s / 2 \ We complete the proof for the case of @(n,k,a,b, A, /3;a) 
and k > 3 by noting that (k — 1) exp j — (1 — 77 )^- j = exp j — (1 — 77") for another sequence 

ry 77 = o(l) under the assumption ^logfc ~^ 00 anc ^ 110 cons t an t or sequence in the foregoing ar¬ 
guments involves B,a or u. When 0 = Q(n,k,a,b, A, /3;a) and k = 2, the foregoing arguments 
continue to hold with /3 and k replaced with 1 and 2 respectively. 

When 0 = ©o(n, k, a, b, /3), we can run the foregoing arguments with Lemma 3 replaced by 
Lemma 2 to reach the conclusion in (17), which does not require condition (18). This completes 
the proof. □ 


< n(k — 1) exp < —(1 — 77 ) 


7.2 Proof of Theorem 3 

The following lemma is critical to establish the result of Theorem 3. Its proof is given in the 
appendix. Let us introduce the notation 0(k\,k2 ) = {V G R fc i xfc 2 . y T y = / fc2 } for k\ > k- 2 - 

Lemma 5. Consider a symmetric adjacency matrix A € {0, l} nxn and a symmetric matrix P G 
[0, l] nxn satisfying A uu = 0 for all u G [n] and A uv ~ Bernoulli(P uv ) independently for all u > v. 
For any C' > 0, there exists some C > 0 such that 

|| T t {A) - P|| 0 p < C^Pmax + 1, 

with probability at least 1 — n~ c ' uniformly over r G [C\ (np nrAX + 1). CAny ™** + 1)] for some 
sufficiently large constants C\. C 2 , where p ma » = ma x u > v P uv . 

Lemma 6. For P = (P uv ) = {B r7 r u ^ tT f v \), we have SVD P = UAU T , where 

U = ZA~ 1 W, 

with A = diag(y/nf ,..., y/nf), Z G {0,1 } nxfc is a matrix with exactly one nonzero entry in each row 
at ( i,a(i )) taking value 1 and W G 0(k,k). 

Proof. Note that 

P = ZBZ t = ZA“ 1 APA(ZA" 1 ) t , 

and observe that ZAr 1 G 0(n,k). Apply SVD to the matrix ABA 2 = WAW T for some W G 
0(k, k ), and then we have P = UAU T with U = ZA~ X W G 0(k, k). □ 

Proof of Theorem 3. Under the current assumption, Er G [C[a, C' 2 a} for some large C[ and C' 2 . 
Using Bernstein’s inequality, we have r G [C\a, € 20 ] for some large Cj and C '2 with probability 
at least 1 — e~ Cn . When (20) holds, by Lemma 5, we deduce that the k th eigenvalue of T t (A) 
is lower bounded by ciAfc with probability at least 1 — n~ c for some small constant ci G (0,1). 
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Figure 6: The schematic plot for the proof of Theorem 3. The balls {TijigM are centered at {<5» }ie [fc] ; and the 
centers are at least away from each other. The balls {C,} ig [ fe ] intersect with large proportions of {T;}ie[fc]; and 
their subscripts do not need to match due to some permutation. 


By Davis-Kahan’s sin-theta theorem [22], we have \\U — UW\ ||f < C^\\T r {A) — P|| op for some 
W\ G 0(k,k ) and some constant C > 0. Applying Lemma 6, we have 

\\U - V\\ F < C^-\\T r (A) - P\\ op , (60) 

where V = ZA~ 1 W 2 G 0(n,k ) for some W 2 G 0(k,k). Combining (60), Lemma 5 and the 
conclusion r G [Cia, C^a], we have 


u-v || F < 


CVky/a 


with probability at least 1 — n 


The definition of V implies that 


(61) 


| |K* K>* 





(62) 


In other words, define Q = A 1 IL / 2 G M fcxfc and we have K* = Qa(u)* l° r each u G [n]. Hence, 

for a(u) / a(v), HQ^)* ~ Q a (v)*\\ = II K* ~K*|| > Recall the definition r = ,uy/| in 

Algorithm 2. Define the sets 


Ti = \ u G a 1 (i) : 


Uu* Qi 


r 

<2 


}, ie[k\. 


By definition, Ti C\Tj = 0 when i ^ j, and we also have 


^-he[7c] Ti 



Uni* h?/,: 



(63) 
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Therefore 


| (Uj S [fc]Tj) c | — < Y, — V w 

u£[n] 


2 C 2 ka 

< 


% ’ 


where the last inequality is by (61). After rearrangement, we have 

1/ .. c i 4C 2 na 

|( u ie[fc]7j) | < 2 2 • 


( 64 ) 


In other words, most nodes are close to the centers and are in the set (63). Note that the sets 
are disjoint. Suppose there is some i E [k] such that |X)| < |cr“ 1 (i)| — | (Uj g [ fc ]Tj) c |, we have 
|'Ag[fc]Tj| = l-^l n ~' 1(^*6 [k\T%) c = |Uj g [fc]Tj|, which is impossible. Thus, the cardinality 


of Ti for each i E [k] is lower bounded as 


. _i . .. ,, . ci n 4 C 2 na n 

\Ti\ > \a x (z)| - | (Uj g[ fc]Tj) | > —-Wo > 


M 2 A 2 2[3k ’ 


(65) 


where the last inequality above is by the assumption (20). Intuitively speaking, except for a negli¬ 
gible proportion, most data points in {I/«*}, tg [ n i are very close to the population centers - 

Since the centers are at least away from each other and [Ti} ie ^] and {Cj}j g [fcj are both defined 

through the critical radius r = for a small /i, each Cj should intersect with only one Ti (see 

Figure 6). We claim that there exists some permutation n of the set [k], such that for Ci defined 
in Algorithm 2, 

Ci n / 0 and \Ci\ > \T n ^\ for each i E [fc]. (66) 

In what follows, we first establish the result of Theorem 3 by assuming (66). The proof of (66) will 
be given in the end. Note that for any i ^ j, n Cj = 0, which is deduced from the fact that 
Cj n T n (j) ^ 0 and the definition of Cj. Therefore, C Cj for all j ^ i. Combining with the fact 
that T^i) nCf C Cf, we get T n ^ HCf C (U ie [k]Ci) c . Therefore, 

(67) 


( 68 ) 


u «e[fc] \ Tir{i)CC![) C ^Uj g [ fc ]CjJ . 
Since Ti C\Tj = 0 for i ^ j. we deduce from (67) that 


Y | T*{()nCl 


< 




By definition, Cj n Cj = 0 for i / j, we deduce from (66) that 


Combining (68), (69) and (64), we have 


= n - ^ |Cj| < n - ^ |Tj| = | (Uj g[A .]Tj) c 

is[fc] ie[fc] 


(69) 


ie[fc] 


< 


4C 2 


na 




(70) 
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Since for any u E Uje[fc](Cj n we have a(u) = i when cr{u) = n(i), the mis-classification rate 

is bounded as 


4(a,vr 1 (u)) < 

< 


< 


< 


u ie[fc] (Cj n 

l - (|(u ie[fc] (6 nr ff(0 )) c n (u ig[fc] Ti)| + |(u ig[fc] Ti) c |) 


Y \ T n(i) n Cf + | ( u ie[fc] r i) C 


8C 2 a 


where the last inequality is from (70) and (64). This proves the desired conclusion. 

Finally, we are going to establish the claim (66) to close the proof. We use mathematical 
induction. For i = 1, it is clear that \C\\ > max ig [ fc ] |T)| holds by the definition of C\. Suppose 
Ci H Tj = 0 for all i E [k], and then we must have 

|(u« w rO c |>|c;i>^ra>^, 


where the last inequality is by (65). This contradicts (64) under the assumption (20). Therefore, 
there must be a 7r(l) such that Ci fl =4 0 and |Ci| > iT^^d. Moreover, 


|Ci nr w(1) | = nCi| 

< ICil-IT^nCil 
= l^i nr^ (1) | 

< |( u ie[fc] r i) C | > 


where the last inequality is because is the only set in {Tj} ig nu that intersects Ci by the 

definitions. By (64), we get 


|Cfnr w(i) |< 


4C 2 na 


(71) 


for i=l. 

Now suppose (66) and (71) are true for i = 1— 1. Because of the sizes of {C ? ;[i_i] and 
the fact that {Ti} ie \ k i are mutually exclusive, we have 




n 


( U ie[fc]\u 


l-l 
i=1 



= 0 . 


Therefore, for the set S in the current step, C S. By the definition of C/, we have 

\Ci\> max ie [^ u !-i^j j jj \Ti\>j^. Suppose C t n / 0 for some i = Then, this 

is the only set in {Tj}j g nu that intersects Ci by their definitions. This implies that 


\Ci\ < |Ci nT^i + |(Uj g [ fc ]T/) c |. 

Since Ci n C n ^ = 0, |Ci n T n ^\ < \Cf n T n ^\ is bounded by (71). Together with (64), we have 

8 C 2 na 


\Ci\< 


2 A? ’ 
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which contradicts \Ci\ > under the assumption (20). Therefore, we must have C; fl ^ = 0 
for all i = 1, I — 1. Now suppose C/ n = 0 for all i E [k], we must have 

|(u iEW T ( )'|>|C,l>^, 

which contradicts (64). Hence, QnT^) ^ 0 for some tt(1) E {7t(z)}, and (66) is established 

for i = l. Moreover, (71) can also be established for i = l by the same argument that is used to 
prove (71) for i = 1. The proof is complete. □ 


7.3 Proof of Theorem 4 

Define P T = P + Tb . The proof of the following lemma is given in the appendix. 

Lemma 7. Consider a symmetric adjacency matrix A E {0, l} nxn and a symmetric matrix P E 
[0, l] nxn satisfying A uu = 0 for all u E [n] and A uv ~ Bernoulli(P uv ) independently for all u > v. 
For any C' > 0, there exists some C > 0 such that 


II L(A t )-L(P t 


| op ^ C\ 


'log (e(np max + 1)) 
WPmax + 1 


with probability at least 1 — n ( " uniformly over t E [C\ (np ma , x + l),C 2 (np max + 1)] for some 
sufficiently large constants C\. C 2 , where p max = max„>„ P uv . 

Lemma 8. Consider P = (P uv ) = (B a (u)a(v))■ Let the SVD of the matrix L(P r ) be L(P r ) = 
ITEU t , with U E 0(n,k ) and S = diag(a \,..., Ofc). For V = UW with any W E 0(r, r), rue have 
||14* — K*|| = y/^ ^ when a(u) / (t('u) and 14* = 14* w/ien cr(tt) = <t(u). Moreover, <7k > ^ 
as long as t > np max . 

Proof. The first part is Lemma 1 in [39]. Define d v = ]P«e[n] 74u and D r = diag(di + r,..., d n + r). 

__1 /o __1 /o 

Then, we have L{P r ) = D r ' P T D T ' . Note that P r has an SBM structure so that it has rank at 
most k, and the k th eigenvalue of P T is lower bounded by A&. Thus, we have 


Ofc > 


_A k _ 

max ng [ n ] d u T t 


Observe that max u6 [ n j d u < np max < r, and the proof is complete. □ 

Proof of Theorem f. As is shown in the proof of Theorem 3, r E \C\a, C^a] for some large C\, C 2 
with probability at least 1 — e~ c n . By Davis-Kahan’s sin-theta theorem [22], we have \\U— C7W||f < 
Ci^.\\L(A t ) — L(P t )||op for some W E 0(r,r) and some constant C\ > 0. Let V = UW and apply 
Lemma 7 and Lemma 8, we have 


f T T/ ,| . C\/ky/a logo 

c - v || F < -7-, 

Afc 


(72) 


with probability at least 1 — n c '. Note that by Lemma 8, V satisfies (62). Replace (61) by (72), 
and follow the remaining proof of Theorem 3, the proof is complete. □ 
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Supplement to “Achieving Optimal Misclassification Proportion in 

Stochastic Block Model” 


By Chao Gao * 1 , Zongming Ma 2 3 , Anderson Y. Zhang 1 and Harrison H. Zhou 1 
1 Yale University and 2 University of Pennsylvania 

A A simplified version of Algorithm 1 


Algorithm 3: A simplified refinement scheme for community detection 

Input: Adjacency matrix A G {0, l} nxn , 
number of communities k. 
initial community detection method a 0 . 

Output: Community assignment a. 


Initialization: 

1 Apply a 0 on A to obtain <t°(u) for all u G [n]; 

2 Define C,; = {u : a°(v) = z} for all i G [k\; let Si be the set of edges within C{. and £ t j the set 
of edges between C* and Cj when i ^ j: 

3 Define 


Bu = 


| Si 


h\ c i\(\ c i\ ~ 1 ) 


i Bij — 


\£jj\ 

\Ci\\Cj 




and let 


a = n min E>a and b = n max Bn. 
i£[k] i^j S[fc] 


Penalized neighbor voting: 

4 For 


define 


P 


1 a(l — bln) 

t = - log A- 

2 b{ 1 — a/n) 



5 For each u E [n], set 


d{u) = argmax ^ A uv - p ^ l {(r o W=i} . 

a Q (v)=l i;S[n] 


B Proofs of Theorem 5 

Proof of Theorem 5. Let us consider 0 = @o (n,k,a,b, /3) and the case of @(n,k,a,b, A, f3;a) is 
similar except that the condition (18) is needed to establish the counterpart of Lemma 3. The 
proof essentially follows the same steps as those in the proof of Theorem 2. First, we note that 
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Lemma 1 continues to hold since it does not need the assumption of a/6 being bounded. Thus, 
the first job is to establish the counterpart of Lemma 2 with rf replaced with C e As before, let 
p = a/n and q = b/n. 

To this end, we first proceed in the same way to obtain (42) - (52). Without loss of generality, 
let us consider the case where t* > log ^ and t u = log ^ since otherwise we can essentially repeat 
the proof of Theorem 2. Note that this implies | > (^) 2 . In this case, with the new t u in (22), we 
have on the event E u , 


(iqe tu + 1 — q){jpe tu + 1 — p) = e T ' 


where 

I' = - log ((1 -p) (1 - q) + pq + (e l y/(1 - p) (1 - q)pq S j 

(73) 

To see this, we first note that for any x, y E (0,1) and sufficient small constant co > 0, if y > x > 
(1 - c 0 )y and jE§ < 1, then 

- log(l -x) = - log(l - y) - log (1 + j > - log(l - y) - 2^—- > - (l - C' c 0 ) log(l - y), 
where C' y = _ (1 _ i , ) ^ g( i_ 1) ) . When f > (J) 2 and t u = log we have V = - log(l - x) for 

e 2 

x = P + q — 2pq — (e tu_t + e t ~ tu )\/(1 — p) (1 — q)pq >p — 2pq — qe tu — pe~ tu > p(l — eo — -y), 
while I* = — log(l — y) for 

y=p + q-2pq- 2y/(l-p) (1 - q)pq < p + q < p(l + (^) 2 ). 

Thus, for any eo E (0, c £ ), 1 — | > y > x > (1 — 2eo )y and jE^ < 1, and we apply the inequality in 
the third last display to obtain (73). 

Thus, the term in (49) is upper bounded by 

( /-, n 3e 0 x ni + ni \ 

exp (^-(1 - C e —) - I J . 


On the other hand, since \e tu — 1| < 1, \e tu — 1| is bounded and x 1, the term in (48) continues 
to be bounded by 


exp 




Moreover, by the same argument as in Lemma 2, (51) continues to hold. Thus, we can replace (52) 
as 


Pi < exp 


_( 

v 3 ; 2 


35 











and so when k > 3, 


P{? n (u) ^ ir u (a(u))} < (k - 1) exp ^1 - j + Cn (1+<5) 

and when k = 2, we can replace /3 by 1 in the last display. 

When k > 3, given the last display and (58), we have 

P{o : ('u) / a(u)} = P{£ u (0u(«)) ^ cr(tt)} 

< PjCu^uH) / a(u), Zu = 7r" 1 } +P{4 7^ T u *} 

< P{ct u (^) / 7T u (cr(u))} + P{6 u / vr^) 1 } 

< £*,-<»* + (*_!)exp {-(l-«£)£}. 

Thus, the assumption that ^ lo g fc —> oo and Markov’s inequality leads to 

P jVcqu) > exp|-(l _C 'e e o)^|| 

< pJ Pn( rr n) "> (h. — "If pyti J - 


< 


< 


' 1 1 V ' 0k J J 

|4(cr,u) > (A; - l)exp|-(l - (7^)^!! 

1 1 ” 

-f---^ *(“)} 

_f C e e 0 nl* \ _ Cn~( 1+ ^ _ 

6 flk f + ( fc _ i )exp |_(! _ C' e ^f i )^|" 


If (& - l)exp{-(l - <7^)^} > n-( 1+<5 / 2 \ then 


If (A: — 1) exp j- 


4(^) > ex P { (1 - c < £ »)^}} S ex p{-UT^} 

/ n n 5eo \nl*\ ~ —(l+S/2) 


+ CrT 6 ' 2 




(74) 


(75) 


(76) 


(77) 


P | £o(a,a) > exp |-(1 - < P{4(o‘,3 : ) > 0} < ^P{u(u) ^ a(u)} 

< n(k — 1) exp { — (1 — C'e —) ? wrl + CrT s < Cn~ 5 ^ 2 = o(l). (78) 

l 3 f3k J 

Here, the second last inequality holds since (k— 1) exp j —(1 — Ce^) 2 ^-| < exp | — (1 — C e - g 41 ) 2 ^ j < 

n —( 1 + 6 / 2 ). We complete the proof for the case of k > 3 by noting that no constant or sequence in 
the foregoing arguments involves B , a or u. When k = 2, we run the foregoing arguments with /3 
replaced by 1 to obtain the desired claim. □ 


C Proofs of Theorems 6, 7 and 8 

Proposition 1. For SBM in the space ©o(n, k, a, b, /3) satisfying n > 2(3k, we have 
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Proof. Since the eigenvalues of P are invariant with respect to permutation of the community labels, 
we consider the case where a(u) = i for u G j Y^j=x n j ~ 1 , Y2)= 1 n ] } without loss of generality, 

where n j = *-*• Let us use the notation l c / G and 0^ G to denote the vectors with all 

entries being 1 and 0 respectively. Then, it is easy to check that 


P- 


n 


" In lr 


a — b 


n 


E 

i =1 


ViV, 


i u i i 


where t* = (1^, 0^ 2 ,..., , t* = ((£, ,0^ 3 ,..., 0* f 

that {vi}i =1 are orthogonal to each other, and therefore 


v k = (Oj ,0 


1 T \T 
J n k _ 1 ) n k ) ■ 


Note 



n 

> mm m > —— 


ie[k] 


pk 


1 > —. 

“ 2/3 k 


By Weyl’s inequality (Theorem 4.3.1 of [35]), 


A k (P) > 



+ A n 




a — b 
~ 2pk ' 


This completes the proof. 


□ 


Proof of Theorem 6. Let us first consider @o(n,k,a,b,P). By Theorem 3 and Proposition 1, the 
misclassification proportion is bounded by C UI1( L r the condition — c ^ or some sma h 

c. Thus, Condition 1 holds when = o(l), which leads to the desired conclusion in view of 

Theorem 2 and Theorem 5. The proof of the space 0(n, k, a , b , A, /3; a) follows the same argument. 

□ 

Proof of Theorem 7. The proof is the same as that of Theorem 6. □ 


Proof of Theorem 8. When the parameters a and b are known, we can use r = Ca for some suffi¬ 
ciently large C > 0 for both USC(r) and NSC(r). Then, the results of Theorem 3 and Theorem 4 
hold without assuming a < C±b or fixed k. Moreover, a u and b u in (11) and (22) can be replaced 
by a and b. Then, the conditions (16) and (18) in Theorem 2 and Theorem 5 can be weakened as 
7 = o(k~ 1 ) because the we do not need to establish Lemma 1 anymore. Combining Theorem 2, 
Theorem 3, Theorem 4 and Theorem 5, we obtain the desired results. □ 


D Proofs of Lemma 5 and Lemma 7 


The following lemma is Corollary A. 1.10 in [5]. 

Lemma 9. For independent Bernoulli random variables X u ~ Bern(p u ) and p 
have 


Y, (X u -p u )>t 1 < exp f t - ( pn + t) log ( 1 H- 

\ue[n] / A \ pn 


for any t > 0. 

The following result is Lemma 3.5 in [19]. 


k'LuelnjPu, we 
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Lemma 10. Consider any adjacency matrix A E {0, l} nxn for an undirected graph. Suppose 
max ug [ n ] X^e[n] ^ U v < 7 cmd for any S,T C [n], one of the following statements holds with some 
constant C > 0: 


1. 


e(S,T) < r 
| 5 | m i 7 O, 


2. e(5,T)log(|g|) < C\T\ log ^, 

where e(S,T) is the number of edges connecting S and T. Then, Y1 ( uv )gh x u^uvVv < C y/f 
uniformly over all unit vectors x,y, where H = {(u,v) : \x u y v \ > s/S'/n} and C' > 0 is some 
constant. 


The following lemma is critical for proving both theorems. 

Lemma 11. For any r > C( 1 + np max ) with some sufficiently large C > 0, we have 

71 

|{« e N : du > r}| < - 

T 

with probability at least 1 — e~ c ' n for some constant C’ > 0. 

Proof. Let us consider any fixed subset of nodes S C [n] such that it has degree at least r and |S| = l 
for some l E [n]. Let e(S) be the number of edges in the subgraph S and e(S, S c ) be the number of 
edges connecting S and S c . By the requirement on S , either e(S) > C\It or e(S, S c ) > C\It for some 
universal constant C\ > 0. We are going to show that both P (e(S) > C\It) and P (e(S, S c ) > C\lr) 
are small. Note that Ee(S') < C^Pmax and Ee(5, S c ) < CAnp m ^ for some universal C 2 > 0. Then, 
when r > C(np max + 1) for some sufficiently large C > 0, Lemma 9 implies 

r ( C (S) > c,lr) < exp (- jC,irlog (l + , 

and 

neiS.S') > W) < exp (-iftlrlo* (l + ^g_)) ■ 

Applying union bound, the probability that the number of nodes with degree at least r is greater 
than fn is 

P(|{« € [n] : d u > r}| > fnj 

< ^2 p(|{« € [n] : d u > r}| = 

l>£n 

£ EE (P (e(S) > Cfr) + P (e(S, S c ) > Cfr)) 

l>£n |5|=Z 

< £ exp (ilog f) (exp (-jC^rlog (l + 

+ exp(-^ lI rlog(l + ^W_))) 

< £ 2 exp (dog f - iciirlog (l + ) 

< exp(— C'n), 
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where the last inequality is by choosing £ = r 1 . Therefore, with probability at least 1 — e c ' n , 
the number of nodes with degree at least t is bounded by r _ 1 n. □ 

Lemma 12. Given r > 0, define the subset J = {u £ [n] : d u < t}. Then for any C > 0, there is 
some C > 0 such that 

\\Ajj-Pjj\\ 0 ,<c + ^+ ^frf==) • 

with probability at least 1 — n~ c '. 

Proof. The idea of the proof follows the argument in [26, 24], By definition, 

-Pjjllop — sup ^ ^ X U (A UV Puv)Uv 
x ^ Sn -\u,v)GJxJ 

Define L = {(u,u) : \x u y v \ < {yfr + y/p max n)/n} and H = {0,u) : \x u y v \ > (y/f + ^/p max n)/n}, 
then we have 


-Pjjllop ^ sup ^ ^ x u (A uv 
x^gS ”- 1 ( UjV ) eLnJxJ 

A discretization argument in [19] implies that 
sup ^2 x u (A uv - P uv )y v < 

x,y£S n 1 ( U; „) g £ n J X J 


P-I I V ) Uv 


+ sup 


E x ^ A uv P'HV )ilv 

(u,v)eHnJxj 


max max x u (A uv - EA uv )y v 

x,yeM S<Z[n\ , 

{u,v)gLC\SxS 

+ max max Y'' x u (EA uv - P uv )y v , 

x,yGjV Sc[n] ^ „ 

(u,v) ELnSxS 


where AT C S n 1 and \J\f\ < 5 n . Then, Bernstein’s inequality and union bound imply that 
max XiyeA /-max Sc [ n ] J2(u,v)glhSxS x u{A uv ~ E A uv )y v < C(y/r + y/np mSLX ) with probability at least 

1 - e~ Cn . We also have max^^gy max Sc [„] 52( u ,v)eLnSxS x u(^A uv - P uv )y v < ||EH - P|| op < 1. 
This completes the first part. 

To bound the second part sup^ggn-i v)ghcJx j x u(A uv — P uv )yv, we are going to bound 
su Px,j/eS n_1 ^(u,v)GHnJx j x u^uvVv and sup^ ^ggn-i Y(u,u)eHnJx j x uPuvVv separately. By the defi¬ 
nition of H , 


sup ^2 x u P uv y v 

x,yGS n ~* (u,v)GHC[Jx J 


sup ^2 

x^GS"- 1 (u, V )GHrJxJ 


X^ ip' 

I x u y v | 


Pu 


WPmax 
+ y/PmaxP ' 


To bound sup x ye gn-i Yl( U v)GHnJxj x uA uv y v , it is sufficient to check the conditions of Lemma 10 
for the graph Ajj. By definition, its degree is bounded by r. Following the argument of [45], the 
two conditions of Lemma 10 hold with 7 = r + np max with probability at least 1 — n~ c . Thus, 
sup xyeS n-i J2{u,v)GHnJxJ x u A uvVv < C(yfir + yj npmax) with probability at least 1 - n~ c ". Hence, 
the proof is complete. □ 


Proof of Lemma 5. By triangle inequality, 


II t t (a) - P Hop < ||r T (A) - t t (p) Hop + ||t t (p) - p|| op , 

where T r (P) is the matrix obtained by zeroing out the u th row and column of P with d u > r. Let 
J = {u £ [n] : d u < r}, and then ||T r (H) — T r (P)|| op = || Ajj — Pjj|| op , whose bound has been 
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established in Lemma 12. By Lemma 11, | J c \ < n/r with high probability. This implies ||T r (P) — 
Pllop < || T r (p) - P||f < y/2n\J c \Pmax < ^^ max • Taking r € [Ci(l + np max ), C 2 (l + np max )], the 
proof is complete. □ 


Now let us prove Lemma 7. The following lemma, which controls the degree, is Lemma 7.1 in 

[42]- 

Lemma 13. For any C' > 0, there exists some C > 0 such that with probability at least 1 — n~ c ', 
there exists a subset J C [n] satisfying n — \ J\ < 2e ( np n — an d 

I d v - Ed v | < CyJ (np max + 1) log(e(np max + 1)), for all v e J, 
where d v = J2 u e[n] A uv 

Using this lemma, together with Lemma 11 and Lemma 12, we are able to prove the following 
result, which improves the bound in Theorem 7.2 of [42], 

Lemma 14. For any C' > 0, there exists some C > 0 such that with probability at least 1 — n~ c ', 
there exists a subset J C [n] satisfying n — | J| < n/d and 


\\{L(A t ) — L(P t )) JxJ \ 


op ^ C 


( y/d log d(d + t) yfd 

V ^ + 


where d = e(np max + 1). 

Proof. Let us use the notation d v = X^ne[n] A uv in the proof. Define the set Ji = {v E [n] : d v < C\d} 
for some sufficiently large constant C\ > 0. Using Lemma 11 and Lemma 12, with probability at 
least 1 — n~ c , we have 

n-\Ji\<^, (79) 

and 

||(A-P)j lJl || op < cVd. ( 80 ) 

Let J 2 be the subset in Lemma 13, and then with probability at least 1 — n~ c , J 2 satisfies 


77 

n-\H< Yd , (81) 

and 

| d v — Ed v | < Cyjd log d, for all v 6 J 2 . (82) 

Dehne J = Ji n J 2 . By (79) and (81), we have 


71 

n — \ J\ = |(Ji n J 2 ) c | < |Jf| + |J 2 C | =n- |Ji| + n- |J 2 | < -, (83) 

and 

\\(A-P)jj\\ op <\\(A-P)j lJl \\ op <cVd. (84) 

Moreover, (82) implies 

max | d v — E d v \ < C yjd log d. 

v£j 

Dehne d v = ^ ue [ n ] p uv Then, 

max \d v — d v \ < max | d v — Ed v \ + 1 < Cy/d log d. (85) 

v£j dSJ 
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Define D r = diag(di + r, d n + r) and D r = diag(di + r, d n + r). We introduce the notation 
R = (A t )jj, B = (D T )-y 2 , R = (P t )jj, B = (D T )~j /2 . 

Using (85), we have 


l-B — L>|| op < max 

»£[n] 


\A4 + r \J d v + 


T 


< C 


log d 


T' 


3/2 


for some constant C > 0. The definitions of B and B implies ||R|| op V ||H|| op < —We rewrite 

the bound (84) as \\R — -R|| 0 p < C\fd. Since all entries of KA T is bounded by (r + d)/n , we have 
lollop < ||EA r || op < d + r. Therefore, ||i?|| 0 p < ||R||o P + ||R - -R|| op < C(d + r). Finally, 

||(L(H r )-L(P r )) JxJ || op 

— PllopPllopP — Bllop + ||-B||op||-R — i^llopPHop + || B — -B|| 0 p||B|| 0 p||B|| 0p 

^ y/d log d(d + r) + Vd\ 


< C 




The proof is complete. 


□ 


Proof of Lemma 7. Recall that d = np max + 1. Following the proof of Theorem 8.4 in [42], it can 
be shown that with probability at least 1 — n ~ c ', for any J C [n] such that n — \ J\ < n/d, 


II L(A t ) L(P t )\\o P < \\(L(A t ) L(P T ))jj||op + C[ + 


log d\ 


where the first term on the right side of the inequality above is bounded in Lemma 14 by choosing 
an appropriate J. Hence, with probability at least 1 — 2 n~ c , 

m a) - ur,) nop < c («±il + Lf\ +c (P + . 

Choosing r e [Ci(l + np max ), C^l + np max )\, the proof is complete. □ 
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